1.Load the packages tidytext and tidyverse
library(tidyverse)
library(tidytext)
Step 1: Packages are loaded for widely used functions and to tidy data sets.
#read in the data
reviews <- read_csv("amazonbeauty_review.csv")
## Rows: 1150 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): reviewerID, asin, reviewerName, reviewText, summary, reviewTime
## dbl (4): helpful__001, helpful__002, overall, unixReviewTime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Step 2: The Amazon Reviews dataset is loaded into the space as reviews.
Step 3: The data set has 1150 observations and 10 variables. One of the variables has the review text.
bigrams <- reviews %>%
unnest_tokens(bigram, reviewText, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))
bigrams
Step 4: a tidied datset is created with reviewText column tokenized into (2) words and filtered to remove any NA values.
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
subset(nchar(gsub("[^ ]", "", word1)) < 3) %>%
subset(nchar(gsub("[^ ]", "", word2)) < 3)
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered
Count the most common bigrams.(10 points)
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
Step 5: The bigrams are separated into (2) columns, one word each. Word1 and word2 columns are filtered to remove all stop words. A count of words in both columns are conducted and sorted in a decreasing manner. This shows the number of times the words are present.
library(igraph)
Step 6: A graphing package is loaded.
# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
filter(n > 20) %>%
graph_from_data_frame((directed = FALSE))
bigram_graph
## IGRAPH 49016fd UN-- 21 14 --
## + attr: name (v/c), n (e/n)
## + edges from 49016fd (vertex names):
## [1] sensitive --skin dry --skin curling --iron
## [4] alpha --hydrox oily --skin highly --recommend
## [7] hand --cream fragrance --free oil --free
## [10] acne --prone prone --skin buf --puf
## [13] combination--skin skin --care
Step 7: We filter the words for the most common by accepting words that apperaed more than 20 times throughout the reviews.
library(ggraph)
Step 8: We load the graphing package for the next step.
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), show.legend = FALSE,edge_colour = "cyan4") +
geom_node_point(size = 1) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Step 9: We first set our seed for replication purposes and visualize the network graph without arrows.
The graph shows the most common bigrams in the Reviews data set. The words shown occur at least 20 times, and were not stop words. IN the reviews, ‘highly recommend’ is commonly used for proposed products. Once we cleaned and visualized the bigrams, it is clear to see the most common words were sectioned based on skin, hair, and even review words like ‘free’, and ‘prone’.